Developer CD Series 1996 November: Tool Chest

home *** CD-ROM | disk | FTP | other *** search

/ Developer CD Series 1996 November: Tool Chest / Dev.CD Nov 96 TC / Dev.CD Nov 96 TC.toast / Tool Chest / Development Tools & Languages / • Other Platforms / PCCTS 1.31 / Documentation / UPDAT110.txt < prev next >

Wrap

Text File | 1995-03-10 | 59.8 KB | 2,377 lines | [TEXT/MPS ]

The Purdue Compiler Construction Tool Set Version 1.10 Release Notes ANTLR and DLG August 31, 1993 Terence Parr Army High Performance Computing Research Center, University of MN (parrt@acm.org) Will Cohen and Hank Dietz School of Electrical Engineering Purdue University (cohenw@ecn.purdue.edu) (hankd@ecn.purdue.edu) 1. Introduction This document describes the changes/enhancements in PCCTS since version 1.06. As with the 1.06 release notes, these notes do not constitute the complete reference manual. Unfortunately, the reference manual is in the same condition as it was for the 1.00 release in the Spring of 1992. We are working on a total rewrite of the manual, which might end up in a book consisting of the theory behind practical k-token lookahead for k>1 (Terence Parr's Ph.D. thesis), ANTLR implementation notes, and the PCCTS user's manual. The 1.10 release of PCCTS has four main enhancements: fully implemented semantic predicates (<<...>>?), infinite lookahead (plus selective backtracking that uses it), increased ANTLR (see -ck and ZZUSE_MACROS sections) and DLG speed, and the ability to link multiple ANTLR parsers together. A number of bug fixes have been incorporated as well. The tutorials have not been updated much for this release. To better support our user's, we have established a mailing list called pccts-users that you can subscribe to by sending email to pccts-users-request@arc.umn.edu with a body of "subscribe pccts-users your-name". Users that have reg- istered with the PCCTS mail server at pccts@ecn.purdue.edu have not been automatically subscribed. Once you have sub- scribed, posting a message to the PCCTS community is as _________________________ Partial support for this work has come from the Army Research Office contract number DAAL03-89-C-0038 with the Army High Performance Computing Research Center at the U of MN. September 14, 1994 - 2 - simple as sending email to pccts-users@ahpcrc.umn.edu (with any Subject: and body). You can also send a body of help to pccts-users-request@ahpcrc.umn.edu to get help on using the mailing list server. We have finally agreed on a numbering scheme for PCCTS releases: x.yz where x reflects the major release number (new tool additions), y reflects major new feature addi- tions, and z reflects bug fixes and minor feature additions (minor releases). September 14, 1994 - 3 - 2. Semantic Predicates The fundamental idea behind semantic predicates has not changed since the 1.06 release -- semantic predicates indi- cate the semantic validity of continuing with the parse or of predicting a particular production. However, we now col- lect all predicates visible to a syntactically ambiguous parsing decision rather than just the first one encountered as in 1.06. In addition, the context of the predicate can be computed and hoisted with the predicates; helpful warning are also generated for incompletely disambiguated parser decisions. The only backward incompatibilities are that parsing does not halt automatically if a semantic validation predicate fails and the -pr is obsolete -- the specification of a predicate implies that it may be used by ANTLR to vali- date and disambiguate as it sees fit. In this section we discuss all of these issues. 2.1. Visible Semantic Predicates Given a syntactically ambiguous parser decision (or, more accurately, a non-deterministic decision), ANTLR attempts to resolve it with semantic information -- ANTLR searches for visible predicates. A visible predicate is a semantic predicate that could be evaluated without consuming an input token or executing a user action (except initial- ization actions, which generally define variables). All visible predicates reside on the left edge of productions; predicates not on the left edge can only function as valida- tion predicates (see 1.06 release notes). Consider a simple example: a : <<pred1>>? ID | <<pred2>>? ID ; Assuming that lookahead information is insufficient to pre- dict productions of rule a, ANTLR would incorporate pred1 into the prediction expression for the first production and pred2 into the second prediction expression. Multiple pred- icates can be hoisted (which may or may not be what you want): decl: <<pred1>>? var | <<pred2>>? ID ; var : <<is_var(LATEXT(1))>>? ID ; In this case, the prediction expression for production one September 14, 1994 - 4 - of rule decl would resemble (with context computation turned off -- see below): if ( pred1 && is_var(LATEXT(1)) && LA(1)==ID ) { var(); } ... Here, two visible predicates were found to disambiguate the prediction of the first production of rule decl whereas only one was found for the prediction of the second production. The action of evaluating a predicate in a decision is called hoisting. In the first example of this section, predicates were not moved since they reside at the decision point in rule a, but technically we say that they were hoisted into the decision. In the second example, pred1 was hoisted from decl and is_var(LATEXT(1)) was hoisted from rule var to predict the first production of decl. 2.2. Context of Semantic Predicates In release 1.06, predicates were hoisted without com- puting and hoisting the context of that predicate. Context is important because, as we saw in the last section, predi- cates may be evaluated in totally different rules. Imagine a rule that had many alternative productions, two of which were syntactically nondeterministic because of a common lookahead of ID (assuming that only one symbol of lookahead is available for simplicity). a : A ... | B ... | classname | C ... | varname | D ... ; classname : <<pred1>>? ID ; varname : <<pred2>>? ID ; Simply incorporating predi into the production prediction expressions for alternatives three and five is not safe for two reasons: [1] Evaluation of predi may cause a program execution error September 14, 1994 - 5 - if evaluated on the wrong type of data. predi will be evaluated on any input, which is {A, B, C, D, ID} in our case -- predi may "core" if fed non-ID token types. [2] predi may give misleading results even if it does not "core". predi may return false even though the produc- tion is not dependent on the predicate for that token type. For example: a : (var | NUM) ... | <<!is_var(LATEXT(1))>>? ID ... ; var : <<is_var(LATEXT(1))>>? ID ; The first production will never match a NUM because is_var(LATEXT(1)) will always evaluate to false for that token type since numbers are not variables ever (the predicate in var is hoisted for use in the deci- sion for rule a). The way to solve both problems is to change predi to: LA(1)==ID ? predi : 1 The 1 merely indicates that if the lookahead is not an ID then enable the production for normal parsing -- we have no semantic information that establishes the validity or inva- lidity of the production. Context computation similar to this is can now be done automatically (-prc on). Unfortunately, as mentioned previ- ously in this document, computing full LL(k) lookahead is, in general, an exponential problem; hence, for large gram- mars you may want to keep this off with -prc off (default) and include context tests manually in your predicates. The old -pr option to turn on parsing with predicates is now ignored as the specification of a predicate indicates that it should be used. ANTLR does its best to warn the user when a possibly incompletely disambiguated grammar has been specified. In other words, when a syntactically ambiguous decision is resolved with semantic predicates, all mutually ambiguous productions must have at least one semantic predicate asso- ciated with it. For example: September 14, 1994 - 6 - a : <<pred>>? ID ... | ID ... ; This grammar will yield a warning when run through ANTLR with -w2 set because semantic information was not provided to indicate the validity of the second production: t.g, line 1: warning: alt 2 of the rule itself has no predicate to resolve ambiguity However, rule a will behave correctly because if pred fails, the second production will be attempted as the default case (remember that a missing semantic predicate is equivalent to <<1>>?). Adding a third production that began with ID would not behave correctly as the last ID-prefixed production would never be matched. As a more complicated example, consider the following incorrectly disambiguated grammar: a : b | <<pred1>>? ID | <<pred2>>? NUM ; b : <<pred3>>? ID | NUM ; Rule a cannot predict which production to match upon looka- head ID or NUM. Alternatives 2 and 3 have been disam- biguated, but the first production hoists a predicate that only "covers" ID's. As a result, the following message is generated by ANTLR: t.g, line 1: warning: alt 1 of the rule itself has no predicate to resolve ambiguity upon { NUM } This detection is a great help during grammar development. Ambiguity warnings are now turned off for decisions that have semantic predicates covering all ambiguous looka- head sequences. 2.3. Failure of predicates Predicates that are not used in disambiguating parsing September 14, 1994 - 7 - decisions are called validation predicates. Previously, validation predicates that failed during parsing printed out a message and terminated the parser: if ( !pred ) {fprintf(stderr, "failed predicate: 'pred'\n); exit(1);} The latest release of ANTLR generates a call to a macro that the user may define called zzfailed_pred(), which is passed a string representing the predicate that failed: if ( !pred ) {zzfailed_pred("pred");} while this solution is not ideal, it is much better than before. 3. Infinite lookahead and Backtracking There are a number of grammatical constructs that nor- mal LL(k) recursive-descent parsing cannot handle. The most obvious example would be left-recursion, but left-recursion can be removed by well-known algorithms. The nastiest gram- mar construct is one in which two alternative productions cannot be distinguished without examining all or most of the production. While left-factoring can handle many of these cases, some cannot be handled due to things like action placement, non-identical left-factors, or alternatives pro- ductions that cannot be reorganized into the same rule. The solution to the arbitrarily-large common left-factor problem is simply to use arbitrary lookahead; i.e., as much looka- head as necessary to uniquely determine which production to apply. ANTLR 1.10 provides two mechanisms for using "infinite" amounts of lookahead. The first is to use semantic predi- cates in conjunction with a user-defined function that scans arbitrarily ahead using a set of macros provided in this release. The second is a more implicit scheme by which the user can annotate those sections of the grammar, which defy normal LL(k) analysis, with syntactic predicates. ANTLR will then generate code that simply tries out the indicated alternative production to see if it would match a portion of the remaining input. If not, the generated parser would try the next viable alternative production. This scheme is a form of selective backtracking (and, hence, can recognize the class of context free languages) where most of a parser is deterministic and only the "hard" parts are done using trial-and-error. As a direct consequence, ANTLR can now generate parsers with the semantic flexibility of LL(k), that are stronger than full LR(k) (in theory), and are nearly linear in complexity; note that the semantic September 14, 1994 - 8 - predicates (first introduced in the 1.06 release) can take ANTLR-generated parsers beyond the context-free language limit into the context-sensitive. We begin this section by introducing the notion of infinite lookahead through an example problem that we solve with semantic and then with syntactic predicates. Following this, we describe in detail the syntax and use of syntactic predicates, which employ infinite lookahead to perform selective backtracking. 3.1. Examples This section presents a simple grammar whose produc- tions have common left-factors that we assume, for the sake of demonstration purposes, to be non left-factorable. With nothing but the grammar, ANTLR would be unable to construct a deterministic parser. We first provide a solution by writing a function that explicitly accesses the infinite lookahead buffer to determine which production should be attempted. This solution is efficient, but would become somewhat tedious for the programmer if it had to be done for each such problem in a large grammar. Fortunately, an eas- ier and more concise solution is provided by syntactic pred- icates, which we also demonstrate using the same grammar. Consider ML which has multiple assignment and list statements. E.g., stat: list Assign list ";" <<printf("list = list\n");>> | list ";" <<printf("list\n");>> ; This grammar is not LL(k) for any k as list can be arbitrar- ily long. The following grammar using semantic predicates to access the infinite lookahead buffer to explicitly com- pute which production will be matched. September 14, 1994 - 9 - /* example use of the infinite lookahead buffer macros * compile with: * antlr list.g * dlg parser.dlg scan.c * cc -Iantlr_includes -DZZINF_LOOK -o list list.c scan.c err.c */ #header <<#include "charbuf.h">> << main() { ANTLR(stat(), stdin); } /* Scan for a "=", but only before a ";" -- return 1 if found, else 0 This performs the same function as using the syntactic predicate: (list Assign list ";")? but uses a semantic predicate coupled with the infinite-lookahead feature. It is somewhat faster as it does not actually *parse* the "list =", it just scans ahead. MUST HAVE "ZZINF_LOOK" PREPROCESSOR FLAG DEFINED (in #header or on compiler command line) */ which() { int i; for (i=1; ZZINF_LA_VALID(i); i++) { if ( ZZINF_LA(i) == Assign ) return 1; else if ( ZZINF_LA(i) == Semi ) return 0; } return 0; } >> #token "[\ \t]+" <<zzskip();>> #token "\n" <<zzskip(); zzline++;>> #token Assign "=" #token Semi ";" stat: <<which()>>? list Assign list ";" <<printf("list = list\n");>> | list ";" <<printf("list\n");>> ; list: "$" elem ("," elem)* "$" ; elem: ID | INT ; #token ID "[a-zA-Z]+" #token INT "[0-9]+" September 14, 1994 - 10 - The infinite lookahead buffer may be accessed with the fol- lowing macros: ZZINF_LA(i) Return the ith token of lookahead relative to the cur- rent position. Hence, ZZINF_LA(1)..ZZINF_LA(k) are equivalent to LA(1)..LA(k). The difference is that i can range from the current token of lookahead until the last token of lookahead with ZZINF_LA(i). ZZINF_LATEXT(i) Identical to ZZINF_LA(i) except that the text of the ith token is returned. ZZINF_LA_VALID(i) Returns 1 if i if at least i non-EOF tokens are left in the input stream else it returns 0. Naturally, the use of infinite lookahead by defining ZZINF_LOOK is inconsistent with interactive parsers as the entire input stream is read in before parsing begins. As mentioned above, this method could be tedious for large grammars, hence, ANTLR provides a more elegant solu- tion. The same problem can be solved with a syntactic pred- icate by changing rule stat in the following way: stat: ( list Assign list ";" )? <<printf("list = list\n");>> | list ";" <<printf("list\n");>> ; Using this implicit method, the need for the semantic predi- cate and the which() function disappears. Let's now consider a small chunk of the vast C++ decla- ration syntax. Can you tell exactly what type of object f is after having seen the left parenthesis? int f( The answer is "no.". Object f could be an integer initial- ized to some previously defined symbol a: int f(a); or a function prototype or definition: September 14, 1994 - 11 - int f(float a) {...} The following is a greatly simplified grammar for these two declaration types: decl: type ID "$" expr_list "$" ";" | type ID "$" arg_decl_list "$" func_def ; One notices that left-factoring type ID "$" would be triv- ial because our grammar is so small and the left-prefixes are identical. However, if a user action were required before recognition of the reference to rule type, left- factoring would not be possible: decl: <</* dummy init action so next action is not taken as init */>> <<printf("var init\n");>> type ID "\(" expr_list "$" ";" | <<printf("func def\n");>> type ID "$" arg_decl_list "$" func_def ; The solution to the problem involves looking arbitrarily ahead (type could be arbitrarily big, in general) to deter- mine what appears after the left-parenthesis. This problem is easily solved implicitly by using the new (...)? syntac- tic predicate: decl: ( <<;>> <<printf("var init\n");>> type ID "$" expr_list "$" ";" )? | <<printf("func def\n");>> type ID "$" arg_decl_list "$" func_def ; The (...)? says that it is impossible to decide, from the left edge of rule decl with a finite amount of lookahead, which production to predict. Any grammar construct inside a (...)? block is attempted and, if it fails, the next alter- native production that could match the input is attempted. This represents selective backtracking and is similar to allowing ANTLR parsers to guess without being "penalized" for being wrong. Note that the first action of any block is considered an init action and, hence, cannot be disabled (by placing it inside {...}) since it may define variables; the first action of the block is a dummy action. At this point, some readers may argue that scanning ahead arbitrarily far, using the infinite lookahead via a semantic or syntactic predicate, renders the parser non- linear in nature. While this is true, the slowdown is September 14, 1994 - 12 - negligible as the parser is mostly linear. Further, it is better to have a capability that is slightly inefficient than not to have the capability at all. 3.2. Syntactic Predicates Just as semantic predicates indicate when a production is valid, syntactic predicates also indicate when a produc- tion is a candidate for recognition. The difference lies in the type of information used to predict alternative produc- tions. Semantic predicates employ information about the "meaning" of the input (e.g., symbol table information) whereas syntactic predicates employ structural information like normal LL(k) parsing decisions. Syntactic predicates specify a grammatical construct that must be seen on the input stream for a production to be valid. Moreover, this construct may match input streams that are arbitrarily long; normal LL(k) parsers are restricted to using the next k sym- bols of lookahead. This section describes the form and use of syntactic predicates as well as their implementation. 3.2.1. Syntactic Predicate Form Syntactic predictions have the form ( )? or, the shorthand form ( )? which is identical to ( )? where and are arbitrary Extended BNF (EBNF) grammar frag- ments that do not define new nonterminals. The notation is similar to the ()* and ()+ closure blocks already present in PCCTS. The meaning of the long form syntactic predicate is: "If is matched on the input stream, attempt to recognize ." Note the similarity to the semantic predicate: <<>>? which means: "If evaluates to true at parser run-time, attempt to match ." September 14, 1994 - 13 - Decisions, which are nondeterministic (non-LL(k) for finite k), are resolved via (..)? in the following manner: a : 1 | 2 ... | ( i )? i ... | j ... | n ; where productions i and j are mutually nondistinguishable from the left-edge. If production i fails, production j will be attempted. Typically, the number of syntactic pred- icates employed is n-1 where n is the number of mutually nondeterministic productions in a decision; the last produc- tion is attempted by default. When a production to be predicted must be predicted with itself (nothing less sophisticated is sufficient) or when efficiency is not a major concern, the short form is used: a : 1 | 2 ... | ( i )? ... | j ... | n ; 3.2.2. Modified LL(k) Parsing Scheme Decisions that are not augmented with syntactic predi- cates are parsed deterministically with finite lookahead up to depth k as is normal for PCCTS-generated parsers. When at least one syntactic predicate is present in a decision, rule recognition proceeds as follows: [1] Find the first viable production; i.e. the first pro- duction in the alternative list predicted by the cur- rent finite lookahead, according to the associated finite-lookahead prediction-expression. [2] If the first element in that production is not a September 14, 1994 - 14 - syntactic predicate, predict that production and go to [4] else attempt to match its predicting grammar frag- ment. [3] If the grammar fragment is matched, predict the associ- ated production and go to [4] else find the next viable production and go to [2]. [4] Proceed with the normal recognition of the production predicted in [2] or [3]. For successful predicates, both the predicting grammar frag- ment and the remainder of the production are actually matched, hence, the short form, ()?, actually matches twice -- once to predict and once to apply normally. 3.2.3. Syntactic Predicate Placement Syntactic predicates may only appear as the first ele- ment of a production because that is the only place deci- sions are required. For example, the (..)? block in the first production of following grammar has little utility. a : 1 ( )? | 2 ... | n ; There is no question that is to be matched after 1 and try- ing to predict this situation is redundant. Syntactic predicates may appear on the left edge of any production within any subrule, not just in productions at the rule block level. 3.2.4. Nested Syntactic Predicate Invocation Because syntactic predicates may reference any defined nonterminal and because of the recursive nature of grammars, it is possible for the parser to return to a point in the grammar which had already requested backtracking. This nested invocation poses no problem from a theoretical point of view, but can cause unexpected parsing delays in prac- tice. 3.2.5. Grammar Fragments within Syntactic Predicates The grammar fragments within ()? may be any valid PCCTS production right-hand-side; i.e. any expression except new nonterminal definitions. may contain semantic actions and September 14, 1994 - 15 - semantic predicates, although only the semantic predicates will be executed during prediction. 3.2.6. Efficiency In terms of efficiency, the order of alternative pro- ductions in a decision is significant. Productions in a PCCTS grammar are always attempted in the order specified. For example, the parsing strategy outline above indicates that the following rule is most efficient when 1 is less complex than 2. a : (1)? | 2 ; Any parsing decisions made inside a (..)? block are made deterministically unless they themselves are prefixed with syntactic predicates. For example, a : ( (A)+ X | (B)+ X )? | (A)* Y ; specifies that the parser should attempt to match the non- predicated subrule ( (A)+ X | (B)+ X ) using normal the normal finite-lookahead parsing strategy. If a sentence recognizable by this grammar fragment is found on the input stream, then restore the state of the parser to what it was before the predicate invocation and parse the grammar fragment again; else, if the attempt failed, apply the next production in the outer block: (A)* Y 3.2.7. Resolving Ambiguous C++ Statements Quoting from Ellis and Stroustrup ["The Annotated C++ Reference Manual," Margaret A. Ellis and Bjarne Stroustrup, Addison Wesley Publishing Company; Reading, Massachusetts; 1990], September 14, 1994 - 16 - "There is an ambiguity in the grammar involving expression-statements and declarations... The gen- eral cases cannot be resolved without backtrack- ing... In particular, the lookahead needed to dis- ambiguate this case is not limited." The authors use the following examples to make their point: T(*a)->m=7; // expression-statement T(*a)(int); // declaration Clearly, the two types of statements are not distinguishable from the left as an arbitrary amount of symbols may be seen before a decision can be made; here, the -> symbol is the first clue that the first example is a statement. Quoting Ellis and Stroustrup further, "In a parser with backtracking the disambiguating rule can be stated very simply: [1] If it looks like a declaration, it is; otherwise [2] if it looks like an expression, it is; otherwise [3] it is a syntax error." The solution in PCCTS using syntactic predicates is simply: stat: (declaration)? | expression ; The semantics of rule stat are exactly that of the quoted solution. The production declaration will, however, be rec- ognized twice upon a valid declaration and once upon an expression to decide that it is not a declaration. 3.2.8. Revisiting the ML Example To illustrate the utility of the full form of syntactic predicates, reconsider the grammar for the ML-style state- ments provided in the example section above: stat: list "=" list ";" | list ";" ; Rule stat is not LL because list could be arbitrarily long and, hence, predicting which production to apply beforehand is impossible with a finite lookahead depth. There are two solutions in using syntactic predicates, one more efficient September 14, 1994 - 17 - than the other. The first method is, as before, to specify: stat: (list "=" list ";")? | list ";" ; However, this specification unnecessarily matches the list following the assignment operator twice. A more efficient, but functionally equivalent, specification is as follows: stat: (list "=")? list "=" list ";" | list ";" ; This description indicates that, as soon as the "=" has been seen, the first production is uniquely predicted. 3.2.9. Syntactic Predicates Effect on Grammar Analysis ANTLR still constructs normal LL(k) decisions through- out predicated parsers. Only when necessary are arbitrary lookahead predictors used. Constructing LL(k) parsers is an exponential problem that ANTLR goes to great lengths to avoid or reduce in size on average. Unfortunately, for large grammars and k values of more than 2 or 3 ANTLR can take an impractical amount of time. Part of the benefit of (..)? blocks is that, by definition, they defy LL(k) analy- sis. Hence, the exponential, full LL(k) grammar analysis is turned off for any production beginning with a syntactic predicate. In its place, a linear approximation to LL(k) analysis, called LL1(k), is used. This reduces the number of times that arbitrary lookahead (..)? blocks are attempted unnecessarily, though no finite lookahead decision is actu- ally required as the arbitrary lookahead mechanism will accurately predict the production. If the current finite lookahead can predict which pro- duction to apply, syntactic predicates are not evaluated. For example, referring to the C++ declaration versus expres- sion grammar example above, if the current input token were 42, rule stat would immediately attempt the second production -- expression. On the other hand, if the current input token were ID, then the declaration rule would be attempted before attempting expression. If neither produc- tions successfully match the input, a syntax occurs. When constructing finite lookahead sets, the grammar fragment within the (..)? block is ignored. In other words, FIRSTk(()?) is FIRSTk(). September 14, 1994 - 18 - 3.2.10. The Effect of Nondeterminism upon Translation and Semantic Predicates Syntactic predicates are, by definition, not guaranteed to match the current input. Therefore, actions with side- effects, for which no "undo" exists, cannot be executed dur- ing nondeterministic syntactic prediction ("guess" mode). This section describes how ANTLR handles the execution of user-supplied actions and semantic predicates. 3.2.10.1. The Effect upon User Actions PCCTS language specifications do not allow the execu- tion of any semantic action during a syntactic prediction as no undo mechanism exists; this conservative scheme avoids affecting the parser state in an irreversible manner. The only exception to this rule is that initialization actions, which usually define variables visible to the entire rule/function, are not enclosed in if {..} statements to "gate" them out; hence, initialization actions with side effects must be avoided by the PCCTS user. 3.2.10.2. The Effect upon Semantic Predicates Semantic predicates are always evaluated because they are restricted to side-effect-free expressions. During arbitrary lookahead prediction, the semantic predicates that are evaluated must be functions of values computed when actions were turned on. For example, if your grammar has a predicate that examines the symbol table, all symbols needed to direct the parse during prediction must be entered into the table before prediction has begun. Consider the follow- ing grammar fragment which recognizes simplified C declara- tions. decl: "typedef" type declarator ";" /* define new type */ | ( type declarator "\{" )? type declarator func_body /* define function */ | type declarators ";" /* def/decl var(s) */ ; type: built_in_type | <<is_type(LATEXT(1))>>? ID ; declarator : ... /* recognizes a declarator such as "array[3]" */ /* add symbols, both types and vars, to the symbol table */ ; This rule is unnecessarily inefficient, but will illustrate the evaluation of semantic predicates during September 14, 1994 - 19 - nondeterministic prediction. For the purposes of our dis- cussion, we restrict new types to be introduced using a typedef (structures and unions are not allowed). Consider the recognition of the two sentences: typedef int My_int; My_int i; The first production of rule decl will match the first sen- tence, adding My_int to the symbol table as a type name. Production two of decl attempts to match the second sentence with its syntactic predicate. Rule type is entered, which evaluates is_type(LATEXT(1)) (where is_type() is some user- defined function that looks up its symbol argument in the symbol table and returns true if that symbol is defined and is a type). Because the text of the current token of looka- head, My_int, is a valid type, the predicate evaluates to true. Production two of type is applicable semantically and is, therefore, applied. After consuming My_int, the parser successfully applies declarator to i. The next input token is ; which does not match . The nondeterministic prediction fails and production three is predicted by default and is applied. The second production of rule decl could not be rewrit- ten as ( type declarator func_body )? /* define function */ because, presumably, a func_body could define new types. The actions that add these new types to the symbol table would not be executed, however, as the parser would be in nondeterministic mode. Although the semantic predicates would be evaluated correctly, the symbol table would not hold the information necessary to parse the function body during nondeterministic prediction. Also, this revision is very inefficient as it would match the entire function, which could be large, twice. 3.2.11. Comparing the Use of Semantic and Syntactic Predi- cates Language constructs exists that are totally ambiguous syntactically, but easily distinguishable semantically. For example, array references and function calls in Fortran are identical syntactically, but very different semantically. The associated grammatical description is non-LL(k), non- LALR(k), and non-context-free; not even backtracking or infinite lookahead will help this problem. September 14, 1994 - 20 - expratom: ID "$" expr_list "$" | ID "$" expr_list "$" ... ; where expr_list is some rule matching a comma-separated expression list. Putting (..)? around the first alternative production will not change the fact that both productions match the same sentence. However, semantic predicates may be used to semantically disambiguate the rule: expratom: <<isVar(LATEXT(1))>>? ID "$" expr_list "$" | <<isFunc(LATEXT(1))>>? ID "$" expr_list "$" ... ; 3.2.12. Implementation The discussion thus far has described the functionality of syntactic predicates, but their implementation is an equally important topic so that users can understand the new ANTLR parsing mechanism (e.g., so that users can follow along in a debugger while tracking down bugs in their gram- mar). Because productions are assumed to be attempted in the order specified, a nested if-then-else structure is gener- ated. To illustrate the integration of syntactic predicates into the normal ANTLR code generation scheme, consider the following abstract grammar. a : 1 ... | ( i )? i ... | ( j )? j ... | m ; ANTLR generates the following, "slightly sanitized", code: September 14, 1994 - 21 - a() { zzGUESS_BLOCK if ( (1, ..., k) LOOKk(1) ) { match1; } else { zzGUESS if ( !zzrv && (1, ..., k) LOOKk(i) ) { matchi; zzGUESS_DONE matchi; } else { if ( zzguessing ) zzGUESS_DONE; zzGUESS if ( !zzrv && (1, ..., k) LOOKk(j) ) { matchj; zzGUESS_DONE matchj; } else { if ( zzguessing ) zzGUESS_DONE; if ( (1, ..., k) LOOKk(m) ) { matchm; } else goto fail; } } } return; fail: if ( zzguessing ) {zzGUESS_FAIL;} gen syntax error message; resynch parser; } where LOOKk() is the set of lookahead k-tuples that predict . This notation is used as a convenience here whereas ANTLR generates decisions that use as little lookahead as possible in practice. The macros/variables themselves are defined as follows: zzGUESS_BLOCK Define a block of memory to hold the current parser state and the return value of setjmp(), zzrv. zzGUESS Save the current parser state, turn on guessing mode and call setjmp() to record the current C run-time stack state. The result of setjmp() is placed into September 14, 1994 - 22 - zzrv. zzGUESS_FAIL Long jump -- restore the C run-time stack to the state it held before guessing began. zzGUESS_DONE Restore the parser state to the previously saved con- tents. zzguessing This variable is 1 if a prediction is currently under- way and 0 when normal parsing is proceeding. User actions are turned off when this variable is 1. zzrv This variable is the result of doing the setjmp() call, which returns 0 always. When a longjmp() occurs, the C run-time stack will be reset to the state held at the time of the setjmp() and zzrv will be set to a nonzero value. In the view of the C program, it will appear as if the setjmp() has returned without ever having attempted the code in the if following it; execution continues past the if the second time. All semantic actions except initialization actions are enclosed in zzNON_GUESS_MODE { user-defined-semantic-action; } so that they can be "turned off" during a prediction. zzNON_GUESS_MODE is defined as follows: if ( !zzguessing ) The effect of this type of code generation is that a stack of parser states is maintained such that nested nondetermin- istic predictions can be made. As an optimization, when the prediction grammar frag- ment for a production is regular, simpler recognition schemes could be used. 4. DLG Enhancements There have been a number of changes to dlg from 1.06 to 1.10. The main difference is that DLG execution speed is up to 7 times faster than the 1.06 version. A -Wambiguity September 14, 1994 - 23 - option has been added to indicate where ambiguities in DLG specifications exists. It numbers the expressions and prints out for an accept state the possible expressions that could be recognized. Also, a macro called ANTLRs() has been added that behaves just like ANTLR() except that it accepts input from a string rather than a stream: #define ANTLRs(rule, string) {...} 5. Linear-Approximation Lookahead ANTLR-generated parsers predict which rule alternative to match by examining up to k symbols of lookahead. Unfor- tunately, computing (during ANTLR grammar analysis) and examining (during parser execution) the set of possible k- sequences is an exponentially large problem. A linear approximation to this full lookahead exists that requires linear time to compute and to test; further, this approxima- tion handles the majority of parsing lookahead decisions. To avoid the, possibly exponential, computation of full lookahead, ANTLR attempts to use the linear approximation first -- computing full lookahead as a last resort. The reason that ANTLR occasionally goes "off the deep end" when analyzing some big grammars is that ANTLR found a parsing decision that could not be solved with the approximate lookahead and required exponential time to compute the full lookahead. Because the approximation has linear time and space complexity, its lookahead depth can be much deeper than that of the full lookahead. Consequently, the approximate looka- head is sometimes stronger than the full lookahead because it can look farther ahead without consuming an impractical amount of system resources. We have added an ANTLR option, called -ck n, that allows the user to specify how deep the linear approximation analysis should go before giving up and trying the full lookahead computation. This new feature is best described with an example: a : (A B|C D) E | A D F ; The full LL(2) lookahead (as would be computed by "antlr -k 2 ...") is summarized in the following table September 14, 1994 - 24 - +------------------------+ | LL(2) | +----------+-------------+ |Lookahead | Alternative | +----------+-------------+ | A B | 1 | | C D | 1 | | A D | 2 | +----------+-------------+ whereas the linear approximate lookahead, denoted LL1(2) (as would be computed by "antlr -ck 2 ..."), is +--------------------------+ | LL1(2) | +------------+-------------+ | Lookahead | Alternative | +------------+-------------+ |{A,C} {B,D} | 1 | | {A} {D} | 2 | +------------+-------------+ where lookahead {A,C} {B,D} implies that the first symbol of lookahead can be either A or C and the second can be either B or D; this lookahead therefore matches the set of sequences {A B, A D, C B, C D}, which is like the cross product of the sets (note the loss of sequence information, which results in the approximation). The decision is LL(2), but is not LL1(2) because the sequence A D predicts both alternatives (i.e., A can be seen first by both and D can be seen second). However, if ANTLR were allowed to look 3 sym- bols ahead -- LL1(3) -- the linear approximation would be sufficient and the complex LL(3) would not be computed. The LL1(3) information ("antlr -ck 3") is summarized in the fol- lowing table: +------------------------------+ | LL1(3) | +----------------+-------------+ | Lookahead | Alternative | +----------------+-------------+ |{A,C} {B,D} {E} | 1 | | {A} {D} {F} | 2 | +----------------+-------------+ Notice that, now, the third symbol of lookahead alone can uniquely identify which alternative to predict. Let's augment our example to have one LL(2) decision and one LL1(3) decision: September 14, 1994 - 25 - a : (A B | C D) E /* LL1(3) */ | A D F b ; b : (A B | C D) Z /* LL(2) */ | A D Z ; Although LL(3) ("antlr -k 3 ...") handles both the LL1(3) and LL(2) decisions, we can make ANTLR and the resulting parser more efficient by specifying "antlr -ck 3 -k 2 ...". The resulting parser decisions are illustrated in the fol- lowing (sanitized) code fragment: a() { /* there are no sequence comparisons for rule 'a' because LL1(3) * is sufficient and full LL(3) analysis is not invoked */ if ( LA(1){A,C} && LA(2){B,D} && LA(3)==E ) { if ( (LA(1)==A) ) { zzmatch(A); zzCONSUME; zzmatch(B); zzCONSUME; } else if ( (LA(1)==C) ) { zzmatch(C); zzCONSUME; zzmatch(D); zzCONSUME; } zzmatch(E); zzCONSUME; } else if ( (LA(1)==A) && (LA(2)==D) && (LA(3)==F) ) { zzmatch(A); zzCONSUME; zzmatch(D); zzCONSUME; zzmatch(F); zzCONSUME; b(); } } September 14, 1994 - 26 - b() { /* LL(2) decision */ if ( (LA(1)==A&&LA(2)==B) || (LA(1)==C&&LA(2)==D) ) { if ( (LA(1)==A) ) { zzmatch(A); zzCONSUME; zzmatch(B); zzCONSUME; } else if ( (LA(1)==C) ) { zzmatch(C); zzCONSUME; zzmatch(D); zzCONSUME; } } else if ( LA(1)==A&&LA(2)==D ) { zzmatch(A); zzCONSUME; zzmatch(D); zzCONSUME; zzmatch(Z); zzCONSUME; } } These examples are small and, hence, the savings are not apparent, but the "compressed" approximation to full lookahead can be used to reduce the ANTLR execution time and resulting parser speed/size. 6. Faster Compilation of ANTLR-Generated Parsers Previous versions of ANTLR used macros rather than function calls for many operations during parsing. Because the macros were invoked numerous times, compilation of these files was slow and generated large object files. The opera- tions are now, by default, function calls which makes compi- lation about 2 times as fast and results in object files about half as large. The macros can be used if necessary by defining ZZUSE_MACROS on the compile line (-DZZUSE_MACROS). 7. Linking Together Multiple ANTLR Parsers Because of the lack of sophisticated "information hid- ing" in C, many ANTLR program symbols are globally visible and, hence, linking multiple ANTLR-generated parsers together would cause many name collisions. To overcome this, we have introduced a new ANTLR directive: #parser "my_parser_name" which prefixes all global, externally visible symbols with prefix my_parser_name_ (remember this when debugging ANTLR parsers). This, clearly, renders the previous "generate prefix" option (-gp) obsolete. Variables, functions and September 14, 1994 - 27 - rule names are remapped through the inclusion of a file called remap.h, which is automatically generated by ANTLR when it encounters a #parser directive. In the future, we expect this to be the name of a C++ object of some class Parser; variables and functions will be referenced as my_parser_name.var_or_func. Consider the following ANTLR example. Files t.g and t2.g are identical except for the parser name. File t.g #header <<#include "charbuf.h">> #parser "t" << void parse_t() { ANTLR(a(), stdin); } >> #token "[\ \t\n]" <<zzskip();>> a : INT INT ; #token INT "[0-9]+" File t2.g #header <<#include "charbuf.h">> #parser "t2" << void parse_t2() { ANTLR(a(), stdin); } >> #token "[\ \t\n]" <<zzskip();>> a : INT INT ; #token INT "[0-9]+" September 14, 1994 - 28 - File main.c #include <stdio.h> extern void parse_ter(); extern void parse_ter2(); main() { parse_ter(); parse_ter2(); } File makefile September 14, 1994 - 29 - DLG_FILE = parser.dlg ERR_FILE = err.c HDR_FILE = stdpccts.h TOK_FILE = tokens.h K = 1 ANTLR_H = ../h BIN = ../bin ANTLR = ../bin/antlr DLG = $(BIN)/dlg CFLAGS = -I. -I$(ANTLR_H) -g AFLAGS = -fe err.c -fl parser.dlg -ft tokens.h -fr remap.h -fm mode.h \ -gt -gk AFLAGS2= -fe t2_err.c -fl t2_parser.dlg -ft t2_tokens.h -fr t2_remap.h \ -fm t2_mode.h -gt -gk DFLAGS = -C2 -i GRM = t.g SRC1 = scan.c t.c err.c SRC2 = t2.c t2_scan.c t2_err.c main.c OBJ1 = scan.o t.o err.o OBJ2 = t2.o t2_scan.o t2_err.o main.o CC=g++ t: $(OBJ1) $(OBJ2) $(CC) -o t $(CFLAGS) $(OBJ1) $(OBJ2) t.o : mode.h tokens.h t.g scan.c mode.h : parser.dlg $(DLG) $(DFLAGS) parser.dlg scan.c t.c parser.dlg tokens.h : t.g $(ANTLR) $(AFLAGS) t.g t2.o : t2_mode.h t2_tokens.h t2.g t2_scan.c t2_mode.h : t2_parser.dlg $(DLG) $(DFLAGS) -m t2_mode.h t2_parser.dlg t2_scan.c t2.c t2_parser.dlg t2_tokens.h : t2.g $(ANTLR) $(AFLAGS2) t2.g The input to the parser is 4 integers because each of the invoked parsers matches 2. The preprocessor symbol zzparser is set the parser name string specified in the #parser "name" directive. WARNING: the remapping of symbols to avoid collisions is not a foolproof system. For example, if you have a rule named type and a field in a structure named type, the field name will get renamed as well -- this is the price you pay for being able to link things together without C++. September 14, 1994 - 30 - 8. Creating Customized Syntax Error Routines Many users have asked how to create their own zzsyn() error handling routine. Here's how: [1] Make new zzsyn() with same parameters. [2] Define the preprocessor symbol USER_ZZSYN on the com- pile line (-DUSER_ZZSYN). 9. Lexical Changes to ANTLR Input The manner in which ANTLR interprets user actions has changed. Strings, character literals, and C/C++ comments are now totally ignored. For example, << // nothing in here is examined $1 ' " /* or in here ' " $ #[jfd] '"''''" */ '"' // that's a character "'" // that's an apostrophe "$1 is", $1 /* $1 inside string is ignored */ >> As a result of this change, you may experience a slight dif- ference in how ANTLR treats your actions. Comments inside actions are still passed through to the parser. C++ comments are now accepted outside of actions as well: // this rule does nothing a : ; Watch out for this: ... << // a comment >> ... a : A ; The C++ style comment in side the action will scarf til end of line and ignore the >> end action symbol. This could be avoided, but I'm feeling lazy just now. September 14, 1994 - 31 - 10. New ANTLR Options Release 1.10 introduces the following ANTLR command- line options: - ANTLR now accepts input from stdin by using the - option; e.g., antlr - A file called stdin.c is created as the output parser. -ck nUse up to n symbols of lookahead when using compressed (linear approximation) lookahead. This type of looka- head is very cheap to compute and is attempted before full LL(k) lookahead, which is of exponential complex- ity in the worst case. In general, the compressed lookahead can be much deeper (e.g, -ck 10) than the full lookahead (which usually must be less than 4). -fm mode_file Rename file with lexical mode definitions, mode.h, to file. -fr file Rename file which remaps globally visible symbols, remap.h, to file. This file is only created if a #parser directive is found. -prc on Turn on the computation and hoisting of predicate con- text. -prc off Turn off the computation and hoisting of predicate con- text. This option makes 1.10 behave like the 1.06 release with option -pr on (default). -w1 Set low warning level. Do not warn if semantic predi- cates and/or (...)? blocks are assumed to cover ambigu- ous alternatives. -w2 Ambiguous parsing decisions yield warnings even if semantic predicates or (...)? blocks are used. Warn if -prc on and some lookahead sequences are not disam- biguated with a hoisted predicate. 11. ANTLR Generates "Androgynous" Code The distribution source of 1.06 PCCTS was generated using 1.06 on a 32-bit machine. Unfortunately, the source code dumped bit sets to arrays of unsigned's according to September 14, 1994 - 32 - the word size of the machine that generated the parser -- regardless of the word size of the various target machines. To overcome this, ANTLR always dumps its bit sets as arrays of unsigned char, which are 8 bits (or more) on any machine that we'd ever want to work on. As a result, ANTLR itself should bootstrap on any machine with a C compiler a enough memory. We have gotten it to compile with 16-bit Microsoft and Borland C on the PC with only a few whimpers. The make- files in the ANTLR and DLG directories have sections for each of the various compilers. 12. Printing out grammars Using the -p option generates grammar listings that are somewhat nicer. 13. C Grammar Changes The C grammar example has been augmented with a -both option that prints out both K&R and ANSI C prototypes for functions defined in the input file. E.g. % proto -both void f(a,b) int a; char *b; {;} ^D void #ifdef __STDC__ f( int a, char *b ) #else f( a, b ) int a; char *b; #endif Functions that already employ ANSI C style argument defini- tions are handled as well. 14. C++ Now Compiles ANTLR Itself We have modified the source code of ANTLR to compile under C++. It is not written to take advantage of C++'s extensions to C, however, except in rare instances. C++'s stricter type checking motivated the modification. 15. New Preprocessor Symbol ANTLR now generates a #define called ANTLR_VERSION that is set to the version of ANTLR that generated the parser. For this release, you will see: September 14, 1994 - 33 - #define ANTLR_VERSION 110 in the output files, which is an integer equivalent of the version number. 16. Attribute Warning A number of users have had trouble with the charptr.h attributes. Please note that they do not make copies and that the memory is freed after the scope exits. For exam- ple, this is wrong because the memory for the $1 attribute of A or B in the (...) scope will be freed upon exit even though $0 will still point to it. #header<<#include "charptr.h">> << #include "charptr.c" main() { ANTLR(a(),stdin); } >> #token "[\ \t\n]" << zzskip(); >> a: "ick" ("A" << $0=$1; >>| "B" << $0=$1; >>) "ugh" << printf("$1, $2, $3 are %s, %s, %s\n",$1, $2, $3); >> ; One should make a copy of the local attribute (or use char- buf.h) as the mem is freed at the end of the scope ($0=strdup($1);). 17. Generation of Line Information The normal form of line information is: # line_number "file" However, many compilers, such as Borland C, prefer it as #line line_number "file" This can be easily changed by looking file generic.h in the antlr directory for the following: September 14, 1994 - 34 - /* User may redefine how line information looks */ #define LineInfoFormatStr "# %d \"%s\"\n" Simply change it as your compiler wants it and recompile the antlr source. 18. Incompatibilities There should be very few incompatibilities with your 1.06-based grammars. Should you find any please let us know. 1.06 semantic predicates were not hoisted into parsing decisions without the -pr flag (now obsolete). In 1.10, the use of a predicate indicates that it may be hoisted. Semantic predicates used to halt parser upon failure whereas 1.10 does not. The interpretation of strings, character literals, and comments are now handled differently; see above. 19. Future Directions This section briefly describes some of the future enhancements either being discussed, planned, or developed. o A graphical user interface is planned for ANTLR gram- mars that will allow the simultaneous dis- play/manipulation of BNF and syntax diagram representa- tions of user grammars. o A source-to-source translator-generator called SORCERER is in prototype form. It's input looks like ANTLR and is integrated so that one description will contain lex- ical, syntactic, and tree-translation information. o A number of groups are working on a C++ grammar. Things are starting to heat up as it is pretty much certain that 1.10 ANTLR is the minimum necessary system to parse C++. o A code-generator generator, called PIGG, is in proto- type form. o An assembler generator is in prototype form. o DLG backtracking will be added. o A new "magic" token type, ".", will be introduced which means "match any single token." September 14, 1994 - 35 - o A new operator will be introduced, "~", which will allow constructs like ~(A|B|C) -- implying "match a single token not from the set {A, B, C}." o A new ANTLR directive will be introduced: #tokclass name { token_list } that creates a set of tokens like the #errclass direc- tive, but one which can be referenced in the grammar. For example: #tokclass AOP { "-" "+" } #tokclass MOP { "/" " } #tokclass OP { AOP MOP } ... e : e1 ( AOP e1 )* ; e1: e2 ( MOP e2 )* ; ... o Simple left-factoring will be introduced to remove identical left factors from alternative productions (assuming user actions do not interfere). o A version of ANTLR (called ANTLR-lite?) is being con- sidered that would accept most ANTLR description syn- tax, delay grammar analysis to run time (where it could be done much more quickly -- with nonexponential com- plexity), scan for tokens with an NFA regular- expression interpreter rather a DFA, and place all out- put in one nice little file. The reduction in parser size would be substantial, but at a parser run-time cost. 20. Portability PCCTS 1.10 is known to compile "out of the box" on the following machines and/or operating systems: [1] DECSTATION 5000 [2] SGI, Running IRIX 4.0.5 [3] Sun SparcStation (cc, gcc, g++, Cfront) [4] DOS and OS/2, Microsft C 6.0, 16 bit [5] DOS and OS/2, Borland C/C++, 16 bit [6] OS/2, IBM C-Set/2, 32 bit September 14, 1994 - 36 - [7] VAX C under VMS [8] Linux 0.99, gcc/g++ [9] NeXT box [10] Amiga, AmigaDOS--SAS/C Development System V 5.10b 21. Beta Testers The following is a group of persons (listed alphabeti- cally) that, in some way, have helped shape and/or debug the latest release of PCCTS. [1] Steven Anderson, (sea@ahpcrc.umn.edu) [2] Douglas B. Cuthbertson, (cuthbert- sond@gw1.hanscom.af.mil) [3] Peter Dahl, (dahl@mckinley.ee.umn.edu) [4] Ed Harfmann, (mdbs!ed@dynamo.ecn.purdue.edu) [5] Randy Helzerman, (helz@ecn.purdue.edu) [6] Stephen Hite, (shite@sinkhole.unf.edu) [7] Dana "Muck" Hoggatt, (mdbs!muck@dynamo.ecn.purdue.edu) [8] Roy Levow, (roy@gemini.cse.fau.edu) [9] John Mejia, (mejia@mckinley.ee.umn.edu) [10] David Poole, (dpoole@nitrogen.oscs.montana.edu) [11] Russell Quong, (quong@ecn.purdue.edu) [12] Aaron Sawdey, (sawdey@mckinley.ee.umn.edu) [13] Fred Scholldorf, (scholl- dorf@nuclear.physics.sunysb.edu) [14] Sumana Srinivasan, (Sumana_Srinivasan@next.com) [15] Ariel Tamches, (tamches@cs.wisc.edu) September 14, 1994